Traffic Violations EDA by Nour Galaby

This dataset contains traffic violation information from all electronic traffic violations issued in the County of Montgomery.

It contains violations from 2012 to 2016. more than 800,000 entry.

Lets see what we can find..

Data Summary:

##   Date.Of.Violation  Time.Of.Violation
##  3/17/2015 :  1281   23:20:00:  1218  
##  5/20/2014 :  1222   23:30:00:  1208  
##  11/24/2015:  1169   23:00:00:  1184  
##  12/8/2015 :  1147   22:53:00:  1179  
##  2/11/2015 :  1135   22:50:00:  1125  
##  5/6/2014  :  1128   22:57:00:  1125  
##  (Other)   :809616   (Other) :809659  
##                                                                                Violation.Description
##  DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS               : 64132   
##  FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER                       : 39614   
##  DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGISTRATION                                   : 32341   
##  FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DISPLAY LICENSE TO UNIFORMED POLICE ON DEMAND: 21027   
##  DRIVER FAILURE TO STOP AT STOP SIGN LINE                                                 : 18965   
##  (Other)                                                                                  :640618   
##  NA's                                                                                     :     1   
##                          Violation.Location    Latitude     
##  IS 370 @ IS 270                  :  1838   Min.   :-77.64  
##  W/B IS 370 @ IS 270              :  1829   1st Qu.: 39.01  
##  10901 WESTLAKE DRIVE             :  1404   Median : 39.06  
##  WAYNE AVE @ DALE DR              :  1278   Mean   : 28.43  
##  CLOPPER RD E/B @ ORCHARD HILLS DR:  1272   3rd Qu.: 39.13  
##  RT 28 @ BLACKBERRY DR            :  1217   Max.   : 77.04  
##  (Other)                          :807860   NA's   :72298   
##    Longitude                                     Geolocation    
##  Min.   :-94.61                                        : 72298  
##  1st Qu.:-77.18   (-76.9907366666667, 39.045425)       :   246  
##  Median :-77.08   (-76.91044, 39.109775)               :   211  
##  Mean   :-66.45   (-77.0271333333333, 38.9920483333333):   117  
##  3rd Qu.:-77.02   (39.109775, -76.91044)               :   116  
##  Max.   : 77.19   (39.0991266666667, -77.0421983333333):    75  
##  NA's   :72298    (Other)                              :743635  
##  Belts.Flag   Personal.Injury Property.Damage Commercial.License
##  No :787012   No :807397      No :802025      No :790389        
##  Yes: 29686   Yes:  9301      Yes: 14673      Yes: 26309        
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  Commercial.Vehicle Alcohol      Work.Zone    Violation.State 
##  No :810076         No :815021   No :816579   MD     :718892  
##  Yes:  6622         Yes:  1677   Yes:   119   VA     : 32483  
##                                               DC     : 19889  
##                                               XX     :  6621  
##                                               PA     :  5838  
##                                               FL     :  3545  
##                                               (Other): 29430  
##                 Vehicle.Type    Vehicle.Production.Year
##  02 - Automobile      :704614   Min.   :   0           
##  05 - Light Duty Truck: 50439   1st Qu.:2001           
##  28 - Other           : 17523   Median :2005           
##  03 - Station Wagon   : 14740   Mean   :2004           
##  06 - Heavy Duty Truck:  8391   3rd Qu.:2009           
##  01 - Motorcycle      :  7947   Max.   :9999           
##  (Other)              : 13044   NA's   :5143           
##  Vehicle.Manfacturer Vehicle.Model    Vehicle.Color    Caused.an.Accident
##  TOYOTA : 88638      4S     : 93512   BLACK  :158699   No :798368        
##  HONDA  : 83667      TK     : 55091   SILVER :148362   Yes: 18330        
##  FORD   : 78439      ACCORD : 28944   WHITE  :120937                     
##  TOYT   : 46659      CIVIC  : 26378   GRAY   : 84933                     
##  NISSAN : 41593      CAMRY  : 25807   RED    : 65285                     
##  (Other):477692      (Other):586915   BLUE   : 61060                     
##  NA's   :    10      NA's   :    51   (Other):177422                     
##  Gender            Driver.City      Driver.State   
##  F:271463   SILVER SPRING:200054   MD     :739032  
##  M:544214   GAITHERSBURG : 83880   VA     : 25038  
##  U:  1021   GERMANTOWN   : 66838   DC     : 24467  
##             ROCKVILLE    : 65780   PA     :  4431  
##             WASHINGTON   : 23623   FL     :  2868  
##             (Other)      :376478   NY     :  2510  
##             NA's         :    45   (Other): 18352

Univariate Plots Section

Exploring The Types of Vehicles

Data contain data about many types from Cars to trucks.. Lets see explore those types

## 
##           01 - Motorcycle           02 - Automobile 
##                      7947                    704614 
##        03 - Station Wagon            04 - Limousine 
##                     14740                       575 
##     05 - Light Duty Truck     06 - Heavy Duty Truck 
##                     50439                      8391 
##   07 - Truck/Road Tractor 08 - Recreational Vehicle 
##                       869                      3157 
##         09 - Farm Vehicle          10 - Transit Bus 
##                        66                       280 
##    11 - Cross Country Bus           12 - School Bus 
##                        45                       129 
##            13 - Ambulance     13 - Ambulance(Emerg) 
##                         1                         5 
##            14 - Ambulance 14 - Ambulance(Non-Emerg) 
##                         2                         8 
##         15 - Fire Vehicle          15 - Fire(Emerg) 
##                         3                         4 
##      16 - Fire(Non-Emerg)        17 - Police(Emerg) 
##                         3                         3 
##       18 - Police Vehicle    18 - Police(Non-Emerg) 
##                         4                         7 
##                19 - Moped       20 - Commercial Rig 
##                       980                       408 
##       21 - Tandem Trailer          22 - Mobile Home 
##                        55                        18 
##  23 - Travel/Home Trailer               24 - Camper 
##                        17                        10 
##      25 - Utility Trailer         26 - Boat Trailer 
##                       856                        40 
##       27 - Farm Equipment                28 - Other 
##                        90                     17523 
##              29 - Unknown 
##                      5409

We Can see form the table that Automobile is the most occuring one which is what we expect

lets see how it compares to other types visualy

Log scale…

Top Models on the dataset

Top Colors of cars

Genders

We can see that females make almost have the number of violations that males make in total.

Exploring the Dates and times

Perhabs the most intersting variable in the dataset. When violations happen ?

This data from the year 2012 to 2016 .. there may be some patterns. but its not clear and its too noisy to note anything. lets smooth it and try again

Adding smoother

defualt smoother doesnt help much.. that is because there is too many data.. Lets group by week. and take average over that week and see

Group by week

Much better.. if you look closely there maybe a pattern here…
but we will look into that shortly… Lets try grouping by month too.

Grouping over each month

now the pattern is clear … to make it even clearer lets group by year and plot
years over each other

Coloring Years

We can see that Violations increase over years.. and there seem to be a certain time where violations peak.

Plotting years

Here we can see that at May we see the most violations of the year.. and followed by october ? could that be the increase of people who travel
there at the summer ? or simply the start of summer and people go out more ?

and at 2015 something was diiferent and the peak was no longer at may.

Lets group by week too ### same for week

Its not very clear lets try something else..

We can see from this violations clearly how much each week differ from each year

Time

Lets make it clearer… by taking average over days for the same time.

We can see from this plot at 6:00 AM most days the number of violations

Caused damage

We can see that only 3.9% from all violation caused personal or property damage.

accidents that caused a problem.. new variable

What is the structure of your dataset?

There are 816698 observation each indicates a single violation in the dataset with 24 variables:

  • “Date.Of.Violation” : Date where violation occured ex :1/1/2012
  • Time.Of.Violation : time when vilation happend ex: 00:01:00
  • “Violation.Description” : Description of violation in text
  • “Violation.Location” : The Location name in text
  • “Latitude” : Latitude location ex: 77.04796
  • “Longitude” : Longitude location ex:39.05742
  • “Geolocation” : both Latitude and Longitude ex:77.1273633333333, 39.0908983333333
  • “Belts.Flag” : if driver had belt at time of violation (Yes, NO)
  • “Personal.Injury” : if any personal injury occured as result of the violation (Yes, NO)
  • “Property.Damage” : if any property damaged occured as result of the violation (Yes, NO)
  • “Commercial.License”: If driver has Commercial License (Yes, NO)
  • “Commercial.Vehicle”: if Vehicle has Commercial.License (Yes, No)
  • “Alcohol” : If Driver was DUI at time of violation (Yes, No)
  • “Work.Zone” : if violation happend in a work zone (Yes, No)
  • “Violation.State” : The state where violation happend ex: MD
  • “Vehicle.Type” : ex: Automobile, Truck, Motorbike
  • “Vehicle.Production.Year”: ex:1990
  • “Vehicle.Manfacturer” : ex:Toyota
  • “Vehicle.Model” : ex: CoROLLA
  • “Vehicle.Color” : ex: Black, White
  • “Caused.an.Accident” : if the violation caused an accident (Yes, No)
  • “Gender” : Gender of driver (M,F)
  • “Driver.City” : City of driver ex:BETHESDA
  • “Driver.State” : ex: MD

What is/are the main feature(s) of interest in your dataset?

I main features of interest are the date and time of violation. and the damage it caused. I would like to see how violations happen yearly, and if there is a certian period where a lot of violations happen.

Did you create any new variables from existing variables in the dataset?

I created 5 variables to help me to group by date:
Date_new : same date but in Date format:
Date_month: the date of the month of violation ex: 2013-12-01
Date_week: similar to Date_month but for weeks
month_only: the month number. ex: 12
week_only: the number of the week in one year where violation happend.
Time_new: POSXit, format of time
Tme_new2 : chr format of time
time_only: factor format of time

I created “Caused.Any.Damage” variable. to be able to tell if an violation causeded any personal injury or proprety danage or an Accident (Yes, No) ### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Analysis

Bivariate Plots Section

Exploring The Location of Violations

Lets start by plotting the locations of each violations

We can see there is two main locations points are centered around * plotting the two main parts one at a time. and avoiding outliers

  • adding alpha since data is crowded By zooming into the main two points…

We can see they kind of form a map.

Lets see where accidents occure compared to where violations occure

It kind make the map of the state.. with major roads highlighted.. zooming in to see a closer picture…

here also we can see it highlights streets; if this coupled with a later of the map of the state it would be much more intersting.. however I will not do this in this project.

number of violations by grouped by gender and month

## , ,  = 2012
## 
##    
##         1     2     3     4     5     6     7     8     9    10    11
##   F  3445  3669  4009  4631  5637  4366  3743  3699  3642  4364  4387
##   M  6729  6600  7535  8519 11751  8694  7944  8413  8414  8782  9196
##   U    79    73    80    20     9    12     7     4    11    13     8
##    
##        12
##   F  4135
##   M  8741
##   U     1
## 
## , ,  = 2013
## 
##    
##         1     2     3     4     5     6     7     8     9    10    11
##   F  4496  4286  5501  4960  5849  4593  5001  5498  5999  5903  5463
##   M  8575  8616 10350  9876 12564  8949 10731 11344 12127 11529 11553
##   U    27     7    10     6     4    12    16    50    44    16    11
##    
##        12
##   F  5324
##   M 11225
##   U     4
## 
## , ,  = 2014
## 
##    
##         1     2     3     4     5     6     7     8     9    10    11
##   F  4945  5246  6566  7429  7712  5982  6751  6031  6089  7087  6370
##   M 10455 10628 12989 13967 14873 11588 12816 11780 12310 13019 12398
##   U    61    14    17     9    16    18     7    10    16    44    28
##    
##        12
##   F  5578
##   M 10859
##   U     9
## 
## , ,  = 2015
## 
##    
##         1     2     3     4     5     6     7     8     9    10    11
##   F  6346  5539  7075  7021  6827  6507  6008  6739  6678  6802  6517
##   M 11891 10820 13668 14383 14068 12624 13411 14503 13278 13107 13368
##   U    10    36    20    14    17    24    10    71    15     9     5
##    
##        12
##   F  5886
##   M 11906
##   U    12
## 
## , ,  = 2016
## 
##    
##         1     2     3     4     5     6     7     8     9    10    11
##   F  5132     0     0     0     0     0     0     0     0     0     0
##   M 10748     0     0     0     0     0     0     0     0     0     0
##   U     5     0     0     0     0     0     0     0     0     0     0
##    
##        12
##   F     0
##   M     0
##   U     0

for both genders, May seems to always have the peak.

#### Vehicle type and Color

Some colors only apear in Automobile like Copper and chrome.. other colors are so common along all types like Black and White and Gray

## Violation Damage

In this part I will discuss what type of violations and how much damage
### Most common violation types

## Source: local data frame [15 x 3]
## Groups: Violation.Description [15]
## 
##                                                          Violation.Description
##                                                                         <fctr>
## 1   DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC CONTROL DEVICE INSTRUCTIONS
## 2           FAILURE TO DISPLAY REGISTRATION CARD UPON DEMAND BY POLICE OFFICER
## 3                       DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGISTRATION
## 4  FAILURE OF INDIVIDUAL DRIVING ON HIGHWAY TO DISPLAY LICENSE TO UNIFORMED PO
## 5                                     DRIVER FAILURE TO STOP AT STOP SIGN LINE
## 6                                          OPERATOR NOT RESTRAINED BY SEATBELT
## 7                    DISPLAYING EXPIRED REGISTRATION PLATE ISSUED BY ANY STATE
## 8  PERSON DRIVING MOTOR VEHICLE ON HIGHWAY OR PUBLIC USE PROPERTY ON SUSPENDED
## 9  DRIVER USING HANDS TO USE HANDHELD TELEPHONE WHILEMOTOR VEHICLE IS IN MOTIO
## 10                                  EXCEEDING THE POSTED SPEED LIMIT OF 30 MPH
## 11                                  EXCEEDING THE POSTED SPEED LIMIT OF 40 MPH
## 12 DRIVING VEHICLE ON HIGHWAY WITHOUT CURRENT REGISTRATION PLATES AND VALIDATI
## 13 FAILURE OF VEH. ON HWY. TO DISPLAY LIGHTED LAMPS, ILLUMINATING DEVICE IN UN
## 14                     EXCEEDING MAXIMUM SPEED: 39 MPH IN A POSTED 30 MPH ZONE
## 15 PERSON DRIVING MOTOR VEHICLE WHILE LICENSE SUSPENDED UNDER 17-106, 26-204, 
## # ... with 2 more variables: Caused.Any.Damage <chr>, n <int>

Plotting People that caused damage by date and gender

I notice something here: Most violations are by males. but the days where males don’t make many violations. Female make many violations. We can and vice verse.. we can see it here in the spikes.. a male postive spike is often coupled with a female negative spike, but this issue should be looked at closer.

It seems that most Alcohol violations for both men and women happen between 2 PM and 5 PM

Multivariant

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

in the plot between the date of violation and Gender, I noticed usually when there is high number of violations of women there is low number of violations of men, and vice versa.. I did not expect that


Final Plots and Summary

Plot One

Description One

This graph shows the count of violations in each minute. it shows when violations generaly happen during the day.

Here are some things to notive about this graph

  • the line is the weighted mean calculated by passing a sliding window.
  • at 00:00 till 8:00 the variance in the number of violations is very low (all points are close)
  • violations peak twice a day; at 7:00 and at 11:00 PM

Plot Two

Description Two

*from this plot we can see number of violations increase over the years till 2014 it reached a peak. then started to come down at 2015

  • May and October have the most violations in all years

Plot Three

Description Three

This plot shows the location of violations of a particular location..zoomed in… I choose it because it looks like the violations draws the map of the streets..

You can tell the major streets by just looking at the violations.. and it looks oddly like a blood veins..

this plot may not convey a lot of information,however I think this plot is very intersting and thats why I choose it.


Reflection

The traffic violation data set contains information on more than 800,000 violation occured from 2012 till 2016. I this shows how much violations increase through the years.and what are the most times violations occure in, which I learned May and October see the most violations.

Also I used this data to get the most popular cars and models.

It seems this data can be used a lot to help reduce violation and understand its causes. like analysing the most locations that vilations occure and understand its causes.

Struggles I had with this dataset is that most of its variables are catagorical. and not continuous. This made it very hard to derive insights and make comparsions, I heavily relied on the “count” of violations variable. as I grouped by each categorgy. and I found very intersting insights (like in datetime and location)

one thing to make it better and could be future work is using this data with another labeled maps data. so we can see clearely where the violations occur…

also the description of the violation could be grouped into categories (ex: speeding, traffic light ignore, reckless driving) and studied further to help reduce violations and accidents, and make traffic better for everyone.

Refrences

http://stackoverflow.com/questions/3695497/ggplot-showing-instead-of-counts-in-charts-of-categorical-variables

Udacity Team